Importing Libraries¶

In [1]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import plotly.express as px

Loading User Dataset¶

In [2]:
user_df = pd.read_csv('USER_TAKEHOME.csv')
user_df.shape
user_df
Out[2]:
ID CREATED_DATE BIRTH_DATE STATE LANGUAGE GENDER
0 5ef3b4f17053ab141787697d 2020-06-24 20:17:54.000 Z 2000-08-11 00:00:00.000 Z CA es-419 female
1 5ff220d383fcfc12622b96bc 2021-01-03 19:53:55.000 Z 2001-09-24 04:00:00.000 Z PA en female
2 6477950aa55bb77a0e27ee10 2023-05-31 18:42:18.000 Z 1994-10-28 00:00:00.000 Z FL es-419 female
3 658a306e99b40f103b63ccf8 2023-12-26 01:46:22.000 Z NaN NC en NaN
4 653cf5d6a225ea102b7ecdc2 2023-10-28 11:51:50.000 Z 1972-03-19 00:00:00.000 Z PA en female
... ... ... ... ... ... ...
99995 61fc06d41febf771966da8fa 2022-02-03 16:46:12.000 Z 1992-03-16 08:00:00.000 Z CA en female
99996 6391e7ef90ad5449ec5f782d 2022-12-08 13:34:39.000 Z 1993-09-23 05:00:00.000 Z MO en female
99997 637d5efdd6f2a49c49934dcb 2022-11-22 23:45:05.000 Z 1983-04-19 00:00:00.000 Z RI en female
99998 5f0de23b05d8a6147dc0cafa 2020-07-14 16:50:04.000 Z 1995-06-09 04:00:00.000 Z DE en female
99999 5e7ab436905e9512ff44abc8 2020-03-25 01:30:30.000 Z 1995-12-15 05:00:00.000 Z VA NaN female

100000 rows × 6 columns

In [3]:
#Familiarizing with Dataset
user_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   ID            100000 non-null  object
 1   CREATED_DATE  100000 non-null  object
 2   BIRTH_DATE    96325 non-null   object
 3   STATE         95188 non-null   object
 4   LANGUAGE      69492 non-null   object
 5   GENDER        94108 non-null   object
dtypes: object(6)
memory usage: 4.6+ MB

Checking the Dataset for the Missing and Duplicated Values.¶

In [4]:
#Checking for Null Values
user_df.isnull().sum()
Out[4]:
ID                  0
CREATED_DATE        0
BIRTH_DATE       3675
STATE            4812
LANGUAGE        30508
GENDER           5892
dtype: int64
In [5]:
# Calculate the percentage of null values for each column
null_percentage = (user_df.isnull().sum() / len(user_df)) * 100

#Percentage of Blank Values
print(null_percentage)
ID               0.000
CREATED_DATE     0.000
BIRTH_DATE       3.675
STATE            4.812
LANGUAGE        30.508
GENDER           5.892
dtype: float64

Based on the above exploration, we can see that 30% values for Language columns are Blanks/Null. This is a very high % value can can be considered as a data quality issue as It can act as an hurdle in understanding the User Demographics - Language barrier is an important factor.

In [6]:
#Checking for any duplicated rows as A user should be unique
user_df.duplicated().sum()
Out[6]:
0

None of the rows are duplicated in this dataset.

Studying the Statewise User Demographics¶

In [7]:
#Checking the unique values of State Column
user_df['STATE'].unique()
Out[7]:
array(['CA', 'PA', 'FL', 'NC', 'NY', 'IN', nan, 'OH', 'TX', 'NM', 'PR',
       'CO', 'AZ', 'RI', 'MO', 'NJ', 'MA', 'TN', 'LA', 'NH', 'WI', 'IA',
       'GA', 'VA', 'DC', 'KY', 'SC', 'MN', 'WV', 'DE', 'MI', 'IL', 'MS',
       'WA', 'KS', 'CT', 'OR', 'UT', 'MD', 'OK', 'NE', 'NV', 'AL', 'AK',
       'AR', 'HI', 'ME', 'ND', 'ID', 'WY', 'MT', 'SD', 'VT'], dtype=object)
In [8]:
#User counts for each state
state_counts = user_df['STATE'].value_counts()

#Converting state_counts Series to a DataFrame
state_counts_df = state_counts.reset_index()
state_counts_df.columns = ['STATE', 'User Count']
state_counts_df['Percentage'] = (state_counts_df['User Count'] / state_counts_df['User Count'].sum()) * 100
state_counts_df['Percentage'] = state_counts_df['Percentage'].apply(lambda x: f'{x:.2f}%')

#Plotting a Bar chart for User Count vs State
fig = px.bar(state_counts_df, x='STATE', y='User Count',
             height=600, width=1300, hover_data={'Percentage': True})

fig.update_layout(
    title={
        'text': "Users distribution based on States",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    xaxis_title="States", 
    yaxis_title="User Count"
)

fig.update_xaxes(tickangle=-40)
fig.show()

Based on the above exploration, Majorioty users of Fetch are from TX, FL, CA and NY.¶

Studying Consumer Behavior based on Gender¶

In [9]:
#User counts based on Gender
user_df['GENDER'].value_counts()
Out[9]:
female                    64240
male                      25829
transgender                1772
prefer_not_to_say          1350
non_binary                  473
unknown                     196
not_listed                  180
Non-Binary                   34
not_specified                28
My gender isn't listed        5
Prefer not to say             1
Name: GENDER, dtype: int64

Here for Gender column, we see that there is a lot of inconsistency based on values. This can be considered as a data quality issue. To better userstand the Gender of Fetch's User, we can group similar values together to only have the main gender categories.

In [10]:
#Grouping the similar categories to one group and visually representating users based on Genders
gender_mapping = {
    'female': 'Female',
    'male': 'Male',
    'transgender': 'Transgender',
    'non_binary': 'Non Binary',
    'Non-Binary': 'Non Binary',
    'prefer_not_to_say': 'Prefer not to Say',
    'unknown': 'Unknown',
    'not_listed': 'Unknown',
    'not_specified': 'Unknown',
    'My gender isn\'t listed': 'Unknown',
    'Prefer not to say': 'Prefer not to Say'
}

#Creating a dataframe which has User counts for each main gender category
grouped_gender_df = (user_df['GENDER']
              .map(gender_mapping)
              .value_counts()
              .reset_index()
              .rename(columns={'index': 'Gender_Group', 'GENDER': 'User Count'}))

#Creating the pie chart for Gender 
fig = px.pie(grouped_gender_df, names='Gender_Group', values='User Count', 
             height=600, width=900)


fig.update_layout(
    title={
        'text': "Users Distribution based on Gender",
        'y': 0.95,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top'
    }
)
# Show the plot
fig.show()

From the above pie chart, we can see that Females are Fetch's major user base.¶

Understanding Users based on their Preferred Language¶

In [11]:
#Replacing Blanks with NaN
user_df['LANGUAGE'].replace('', np.nan, inplace=True)
print('The total language records that are not Null: ', user_df['LANGUAGE'].value_counts().sum())

#Getting the percentage of Users based on their preferred Language
lang_counts = (user_df['LANGUAGE'].value_counts() / user_df['LANGUAGE'].value_counts().sum()) * 100
print(lang_counts.apply(lambda x: f'{x:.0f}%'))
The total language records that are not Null:  69492
en        91%
es-419     9%
Name: LANGUAGE, dtype: object

We can see above that out of the Total Not Null Values, 91% Users have their preferred language as English.

Final Thoughts:¶

The major data quality issues found in this dataset:

1) Assuming the language column is the Preferred language of every User, 30% of Values are missing. This can be a major data quality issue as it will hinder user personalization and targeted marketing.

2) Gender has a lot of inconsistent values. To make proper sense of data and use it in future for data analysis, we will need to group the values into main categories(similar to the way I did for plotting Pie chart) to better userstand Fetch's Consumer base. Also, small Percentage of values in Gender are missing.

Minor Data Quality issues found in this dataset:

1) Small percentage of values are missing in State and Birth_Date Colums which will have to be dealt with either by imputing some values or removing the rows as per the use of columns in data analysis

In [ ]: